{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Files \n",
    "\n",
    "**Time**\n",
    "- Teaching: 10 min\n",
    "- Exercises: 5 min\n",
    "\n",
    "**Questions**:\n",
    "- \"How do a open a file and read its contents?\"\n",
    "- \"How do I write a file with the variables I generated?\"\n",
    "\n",
    "**Learning Objectives**:\n",
    "- \"Learn the Pythonic way of reading in files.\"\n",
    "- \"Understand how to read/write text files and csv files.\"\n",
    "* * * * *\n",
    "\n",
    "In this lesson we will cover how to read and write files."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading from a file\n",
    "\n",
    "Reading a file requires three steps:\n",
    "\n",
    "1. Opening the file\n",
    "2. Reading the file\n",
    "3. Closing the file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_file = open(\"example.txt\", \"r\")\n",
    "text = my_file.read()\n",
    "my_file.close()\n",
    "\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- However, use the `with open` syntax and this will automatically close files for you. \n",
    "- The `'r'` indicates that you are reading the file, as opposed to, say, writing to it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# better code\n",
    "with open('example.txt', 'r') as my_file:\n",
    "    text = my_file.read()\n",
    "    \n",
    "print(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`with` will keep the file open as long as the program is still in the indented block, once outside, the file is no longer open, and you can't access the contents, only what you have saved to a variable."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading a file as a list\n",
    "\n",
    "- Very often we want to read in a file line by line, storing those lines as a list.\n",
    "- To do that, we can use the `for line in my_file` syntax:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stored = []\n",
    "with open('example.txt', 'r') as my_file:\n",
    "    for line in my_file:\n",
    "        stored.append(line)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stored"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remember that the variable name can be anything. It does not have to be `line`. Files are simply always read line by line."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- We can use the `strip` [method](https://github.com/dlab-berkeley/python-intensive/blob/master/Glossary.md#method) to get rid of those line breaks at the end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stored = []\n",
    "with open('example.txt', 'r') as my_file:\n",
    "    for line in my_file:\n",
    "        stored.append(line.strip())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stored"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read a text file in one line\n",
    "\n",
    "You can also use the `open` function's `.read()` method. Let's import the \"fiji2014.txt\" file from the Day 4 data/txts folder:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_text = open(\"../Day_4/data/txts/fiji2014.txt\", encoding = \"utf-8\").read()\n",
    "# print(my_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Writing to a file\n",
    "\n",
    "We can use the `with open` syntax for writing files as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# this is okay...\n",
    "new_file = open(\"example2.txt\", \"w\")\n",
    "bees = ['bears', 'beets', 'Battlestar Galactica']\n",
    "for i in bees:\n",
    "    new_file.write(i + '\\n')\n",
    "new_file.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# but this is better...\n",
    "bees = ['bears', 'beets', 'Battlestar Galactica']\n",
    "with open('example2.txt', 'w') as new_file:\n",
    "    for i in bees:\n",
    "        new_file.write(i + '\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at the file we wrote.\n",
    "- An exclamation point `!` puts you in bash"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# for Macs use the `cat` command\n",
    "!cat example2.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# for windows use the `type` command\n",
    "!type example2.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reading/Writing csv files using `pandas`\n",
    "\n",
    "Reading in a dataset that is stored as a \"comma separated file\" (csv) is easy in Python using the `pandas` package. Central to the `pandas` package is the `DataFrame` type, which stores 2-dimensional tabular data in a format similar to Excel spreedsheets.\n",
    "\n",
    "Let's import `pandas` and use it's `read_csv()` function to load the data stored in a csv file into a `DataFrame`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# You might need to install the pandas library first. \n",
    "# Unhashtag the line below and run this cell to install it:\n",
    "# !pip install pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "caps = pd.read_csv('capitals.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can look at the first 5 (or any number) rows of data using the `.head()` method of the `DataFrame` object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "caps.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To see how many data points and variables exist in the dataframe we can simply use the `.shape` attribute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "caps.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or we can get more detailed information about the number of entries (e.g. observations, data points) and the variables for each entry using the `.info()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "caps.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It looks like there is a single missing value in the Capital variable (there are 199 non-null objects, not 200). Let's remove that missing value (or `na`) using the `dropna()` method so that we can save an updated version of the csv file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "caps_nomissing = caps.dropna()\n",
    "caps_nomissing.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That looks better. Now let's write this updated `DataFrame` out to a csv file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "caps_nomissing.to_csv('capitals_nomissing.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For more information on using `pandas` come to the D-Lab's workshop titled \"Introduction to Pandas\". Here's a [link](https://github.com/dlab-berkeley/introduction-to-pandas) to the GitHub repo containing the course materials."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Challenge 1: Read in a list\n",
    "\n",
    "The file `counties.txt` has a column of counties in California. Read in the data into a list called `counties`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Challenge 2: Writing a CSV file\n",
    "\n",
    "Below is a `pandas` `DataFrame` created from a dictionary of lists representing various information about US states. Write this [object](https://github.com/dlab-berkeley/python-intensive/blob/master/Glossary.md#object) as a CSV file called `states.csv`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "states = pd.DataFrame( {'state': ['Ohio', 'Michigan', 'California', 'Florida', 'Alabama'],\n",
    "                        'population': [11.6, 9.9, 39.1, 20.2, 4.9], \n",
    "                        'year in union': [1803, 1837, 1850, 1834, 1819], \n",
    "                        'state bird': ['Northern cardinal', np.nan, np.nan, np.nan, np.nan], \n",
    "                        'capital': ['Columbus', 'Lansing', 'Sacramento', 'Tallahassee', 'Montgomery']})\n",
    "states"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [OPTIONAL] Using the CSV Module\n",
    "\n",
    "In addition to reading csv files using the `pandas` module, Python has a `csv` module that can read csv filese into lists and dictionaries.\n",
    "- In python, a common way to do that is to read a csv as a list of dictionaries. \n",
    "- For this, we use the `csv` module"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#read csv and read into a list of dictionaries\n",
    "capitals = [] # make empty list\n",
    "with open('capitals.csv', 'r') as csvfile: # open file\n",
    "    reader = csv.DictReader(csvfile) # create a reader\n",
    "    for row in reader: # loop through rows\n",
    "        capitals.append(row) # append each row to the list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "capitals[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Writing a list of dictionaries as a CSV is similar:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get the keys in each dictionary\n",
    "keys = capitals[1].keys()\n",
    "keys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# write rows\n",
    "with open('capitals2.csv', 'w') as output_file:\n",
    "    dict_writer = csv.DictWriter(output_file, keys)\n",
    "    dict_writer.writeheader()\n",
    "    dict_writer.writerows(capitals)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "csv.DictWriter.writerows?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for cur_observation_dict in capitals:\n",
    "    cur_line = []\n",
    "    for cur_key in keys:\n",
    "        cur_line.append(cur_observation_dict[cur_key])\n",
    "    output_file.write(cur_line)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "271.6px"
   },
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
